I want to share a take on grammar-constrained generation I've been working on for a while. This post is a preliminary writeup: enough detail to communicate the core read-GLR-stack-with-a-weighted-automata idea (and timestamp it). I’ll likely follow up with a post on the various tricks you need to actually build the damn things without exploding.
You have a grammar (often a JSON Schema) and you want the LLM to produce grammatically valid output.
At each step, the LLM outputs logits over the vocabulary . You sample a token. Append it to the output. Repeat.
Constrained decoding is: don’t sample tokens that would make the output grammatically invalid.
So we maintain a constraint state and, for each step, compute a mask of allowed tokens, and apply it to the logits.
No more broken JSON.
Think of constrained decoding as two functions:
get_mask(state) returns a bitmask over the LLM vocabulary.commit(state, token) updates the parser state with the chosen token.At runtime the loop looks like:
loop:
parallel:
GPU: logits
CPU: mask
apply mask
sample token
commit token to model and constraint
LLM inference is batched: multiple user requests run on the GPU at once. And mask generation sits directly on the critical decoding path.
If you want to generate 1k tokens per second, you have 1 millisecond to produce a mask. And if you miss even a single one - if a mask takes 2ms instead of 1 - it screws up scheduling; the GPU sits waiting on the straggler mask, and suddenly your p99 latency is trash.
Masks can't be late; they have to arrive on time every time. It's worst-case performance that matters.
It’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”
The commit op is analogous to incremental parsing: you’re incrementally parsing a growing prefix.
Many battle-tested incremental parsing systems like tree-sitter and Lezer use Generalized LR (GLR) parsing with a graph-structured stack (GSS).
Glossary
--.Now you have a parse state (often a GSS), and you want a mask over LLM tokens.
A straightforward approach is:
For each LLM token , check whether appending it keeps the parse valid.
But you don’t append an LLM token to the parser. You append terminal symbols.
And a single LLM token is a byte string which might:
"}\n" could lex as RBRACE NEWLINE),true),So “try token ” is really “try every way ’s bytes could lex into some terminal sequence, and then try advancing the parser for that terminal sequence.”
A common direction is:
This can work well for many grammars and workloads. People implement this and get decent average performance, especially for “simple” grammars.
But for general CFG-ish constraints (especially when you care about worst-case), you run into two issues:
Even if your lexical side is deterministic (say you enforce longest-match, or you have a DFA lexer), a single grammar terminal shift can trigger a chain of reductions.
In an LR parser, “shift terminal ” is often really:
Those reduction chains are table-driven and fast in the happy path, but in GLR they can fan out. The operations are pointer-heavy and cache-unfriendly: you’re manipulating a GSS, merging nodes, pruning branches, etc.
If you do that inside masking—i.e. while traversing a trie of possible tokens—you’re effectively building and mutating a GSS for a hypothetical next token, and doing it potentially thousands of times per step.
That’s exactly the kind of work you don’t want on the critical path.
Even with tries, there are some tokens whose byte strings correspond to exceedingly long sequences of terminals. Sometimes these get quickly pruned, e.g. ????. But not always.
Disambiguation
MINUS, NUMBER, IDENT. This is what your parser consumes.For example, Python will happily accept monstrosities like:
The sixteen hyphens get tokenised (BPE) as a single LLM token:
But token 7535 lexes to 16 MINUS terminals:
Every time a mask is generated, the parser must ingest all 16 MINUS terminals of token 7535. This forces it to do a lot of terminal-level processing—16 shifts through the grammar, each of which may involve many reductions—triggered by a single LLM token.
You can transform the grammar to remove unit/null reductions. You can aggressively merge and simplify GSS nodes. You can add clever memoization, early exits. But in my experience, these pesky LLM tokens with complicated lexes are just really hard to engineer away.
I tried hard to make the “trie + incremental parse simulation” approach behave well in worst-case latency terms. In my experience, it’s a dead end if you’re aiming for predictable sub-millisecond masking on arbitrary inputs/grammars.
Token validity depends on what's on the stack. Instead of asking "does LLM token t work on this stack?" for each token, ask "given this stack, which LLM tokens work?"
That's the reframing. Now we need a data structure that turns "stack → allowed tokens" into something we can execute quickly.
That’s where the weighted automaton comes in.
Think of a finite automaton, except each transition carries a weight, and traversing paths accumulates weights.
In our weighted automata:
The automaton reads (a representation of) your current parse configuration:
A token is valid if it’s valid on any parse path in the GSS, so on a GSS you just union the results across paths.
Another way to see it (which helped me):
That set of stacks is a regular language over parser state IDs (this is closely related to the classical “viable prefixes are regular” result from LR theory).
So you can imagine an automaton that recognizes “stacks on which token is valid.”
Of course, if you do that for every , you’d have 200k automata. That’s useless at runtime.
A weighted automaton is how you smash those 200k membership tests into one run:
Same computation, but 'vectorized' across tokens via bitsets (although I find rangesets perform better than bit vectors, so maybe 'vectorized' isn't quite accurate).
In practice you rarely need the full stack. As you scan states, each token’s fate becomes fixed: it either already reaches an accepting state (so it will stay valid) or it has been filtered out forever. Once a token is decided, pushing it deeper won’t change anything.
So maintain a “decided” set as you go. Peel those tokens off the frontier. When everything left is decided, you can stop immediately—no need to read the rest of the stack.
For a single LR stack, the runtime looks like:
def get_mask(stack_state_ids_top_to_bottom):
# frontier: map automaton_state -> bitset(tokens)
frontier = { A.start: ALL_TOKENS }
for sid in stack_state_ids_top_to_bottom:
new = {}
for a_state, tokens in frontier.items():
for (a2, weight) in A.step(a_state, sid):
tokens2 = tokens & weight # ∩ (filter)
if tokens2.any():
new[a2] = new.get(a2, EMPTY) | tokens2 # ∪ (merge)
frontier = new
if decided(frontier):
break
return combine_accepting(frontier)
This is the key operational point:
For a GSS, you do basically the same computation, but over a graph rather than a single list. Conceptually:
A token is valid if it’s valid on any stack path, so whenever GSS paths merge you union their token sets, and whenever automaton paths merge you union there too.
The important part is: it’s still the same “bitset flows through a graph via ∩ and ∪” pattern. No backtracking, no “try token ” loop.
(Implementation detail: the GSS nodes themselves need to carry the propagated token sets, not just the edges. When nodes merge, their associated data must merge too. Getting this efficient requires careful GSS design, but that's a topic for another post.)
Up to now I've treated the weighted automaton as a black box.
Time to open it. The automaton is precomputed from the grammar.
At a high level, you want an automaton that answers:
given the current lexer+parser state (represented by the stack/GSS), which LLM tokens could lead to a valid continuation?
There are two distinct problems mixed together:
The thing we compile is basically a composition of two automata:
An LLM token is a byte string.
A grammar tokenizer/lexer consumes a stream of bytes and emits grammar terminals. Crucially:
So the mapping “token → terminal sequence” is not a single fixed lookup. It depends on the current lexer state.
The Terminal DWA is the precomputed structure that answers:
from lexer configuration , which LLM tokens could produce which sequences of terminals?
A practical way to build it is:
Then determinize. Because it's built over a finite trie of token bytes, the Terminal DWA is acyclic: you only move forward through token bytes, never loop. The key thing the Terminal DWA buys you is: it collapses “iterate over 200k tokens and run the lexer” into “traverse a small automaton state space and get bitsets.”
Now suppose the lexer says “the next terminal is .” What does the LR parser do?
You can precompute, for each terminal , an automaton that reads stack state IDs (top down) and encodes all legal reduction sequences that could precede a shift of . These are closely related to the "template automata" described by Aycock et al. in Even Faster Generalized LR Parsing (Acta Informatica, 2001).
Internally I represent stack effects using the polycyclic monoid: an algebra of stack actions where you have "push state " and "pop state " as generators. Adjacent push-pop pairs cancel when they match, and mismatches annihilate the whole path. This makes composition natural—stack effects compose by concatenation plus cancellation—and lets you prune impossible paths early.
Aycock et al. show that if you normalize the grammar to eliminate right recursion and hidden left recursion, the reduction chains become bounded. This means the template automata can be made acyclic.
This is the piece that takes “pointer-heavy GLR reduction simulation” off the runtime path and turns it into precomputed transitions.
Finally, you compose:
to get one deterministic weighted automaton that:
This is the automaton you run in get_mask.
Since the Terminal DWA is acyclic and the template automata can be made acyclic, their composition—the Parser DWA—is also acyclic. This bounds how deep you ever need to read into the stack.
The key to making composition tractable is the cancellation property of the polycyclic monoid: when you concatenate stack effects, matching push-pop pairs cancel, and mismatched sequences annihilate. This is what keeps the composed automaton from exploding during determinization.
Note that get_mask only needs to determine validity—it doesn't need to compute the actual stack transformation. The real lexer and parser handle that in commit. So once the automaton has filtered a token (accepted or rejected), it can forget the details of which states would be pushed. Only the validity verdict matters.
You might ask “do you really scan the entire stack every time?”
No. In practice you only need a bounded suffix, and you can make that precise.
Two related reasons:
The term “token” is overloaded:
IDENT, "{", NUMBER, keywords, etc.).These tokenizations don’t align.
Even if your lexer uses longest-match, longest-match is inherently forward-looking: you can’t know a match is final until you see what comes next.
Classic example:
"+" followed by "+" should yield INCREMENT ("++"), not two PLUS.In a streaming setting, after you see the first "+", you’re in a state where:
PLUS,INCREMENT.So you have to represent that uncertainty somehow.
A standard trick in incremental lexing is what I'll call inhibited terminals (the terminology varies, but the idea is common):
This plays nicely with GLR+GSS, because “fork and prune” is already the parser’s native move.
Practically, the constraint state you carry around is not just “parser stack(s)”; it’s “(parser stack(s), lexer state(s))”. get_mask needs to condition on both, which is why the Parser DWA has multiple initial states (one per active lexer state).
commit(state, token) uses GLR machinery:
get_mask(state) uses the precompiled weighted automaton:
So you get a split where:
commit is responsible for advancing the parser state with a chosen LLM tokenget_mask is responsible for cheaply answering “what LLM tokens could be next?”Mask computation becomes:
Critically:
Work is proportional to stack structure, not the grammar complexity or vocabulary size.
I’m deliberately avoiding implying there's a “this is optimal” theorem. That'd be misleading. Parsing highly ambiguous CFGs isn’t a domain where you get many satisfying worst-case optimality results.
But it feels close to optimal for two reasons:
For incremental CFG parsing with ambiguity, GLR is a pretty hard baseline to beat. There’s a reason people working on these problems keep converging on GLR-like techniques:
You still inherit GLR’s lack of comforting worst-case bounds in “fully adversarial ambiguity” settings. In theory, GLR can degrade badly on highly ambiguous grammars/inputs. In practice, for JSON schemas and “programming-language-ish” grammars with reasonable disambiguation, it behaves well.
The weighted-automaton mask computation does the minimum work you’d reasonably hope for:
In other words: the runtime work scales with the size of the parse configuration, not with vocabulary size and not with “how gnarly are the tokens.”
The cost shifts to compile time.
Precompiling the grammar into this Parser DWA involves determinization and simplification in a bitset weights setting (union at merges, intersection along paths). If you're not careful, large grammars can blow up in memory/time.
Getting compile-time and memory to behave took most of my engineering effort:
That’s probably the next post.
If you're building constrained decoding and you care about p99.9 latency: don't put real parsing work in get_mask. Keep parsing in commit, and make get_mask a fast precomputed filter. The compile-time cost is worth the runtime predictability.